MRG-DBSCAN: An Improved DBSCAN Clustering Method Based on Map Reduce and Grid
نویسندگان
چکیده
DBSCAN is a density-based clustering algorithm. This algorithm clusters data of high density. The traditional DBSCAN clustering algorithm in finding the core object, will use this object as the center core, extends outwards continuously. At this point, the core objects growing, unprocessed objects are retained in memory, which will occupy a lot of memory and I/O overhead, algorithm efficiency is not high. In order to ensure the high efficiency of DBSCAN clustering algorithm, and reduce its memory footprint. In this paper, the original DBSCAN algorithm was improved, and the G-DBSCAN algorithm is proposed. G-DBSCAN algorithm reduces the number of query object as a starting point. Put the data into the grid, with the center point of the data in the grid to replace all the grid points as the algorithm input. The query object will be drastically reduced, thus improving the efficiency of the algorithm, reduces the memory footprint. In order to make the G-DBSCAN algorithm can adapt to large data processing, we will parallelize the G-DBSCAN algorithm, and combining it with Map Reduce framework. The results prove that G-DBSCAN and MRG-DBSCAN algorithm are feasible and effective.
منابع مشابه
An Efficient Density-based Clustering Algorithm for Higher-Dimensional Data
DBSCAN is a typically used clustering algorithm due to its clustering ability for arbitrarily-shaped clusters and its robustness to outliers. Generally, the complexity of DBSCAN is O(n) in the worst case, and it practically becomes more severe in higher dimension. Grid-based DBSCAN is one of the recent improved algorithms aiming at facilitating efficiency. However, the performance of grid-based...
متن کاملGrid-based Data Stream Clustering for Intrusion Detection
As a kind of stream data mining method, stream clustering has great potentiality in areas such as network traffic analysis, intrusion detection, etc. This paper proposes a novel grid-based clustering algorithm for stream data, which has both advantages of grid mapping and DBSCAN algorithm. The algorithm adopts the two-phase model and in the online phase, it maps stream data into a grid and the ...
متن کاملA New Clustering Algorithm Based on Near Neighbor Influence
Clustering has been used in many areas. It is an unsupervised learning method which tries to find some distributions and patterns in unlabeled data sets. Although clustering algorithms have been studied for decades, none of them is all purpose. This paper presents a new clustering algorithm, Clustering based on Near Neighbor Influence (CNNI), an improved version in time cost of CNNI algorithm (...
متن کاملبررسی مشکلات الگوریتم خوشه بندی DBSCAN و مروری بر بهبودهای ارائهشده برای آن
Clustering is an important knowledge discovery technique in the database. Density-based clustering algorithms are one of the main methods for clustering in data mining. These algorithms have some special features including being independent from the shape of the clusters, highly understandable and ease of use. DBSCAN is a base algorithm for density-based clustering algorithms. DBSCAN is able to...
متن کاملImprovement of density-based clustering algorithm using modifying the density definitions and input parameter
Clustering is one of the main tasks in data mining, which means grouping similar samples. In general, there is a wide variety of clustering algorithms. One of these categories is density-based clustering. Various algorithms have been proposed for this method; one of the most widely used algorithms called DBSCAN. DBSCAN can identify clusters of different shapes in the dataset and automatically i...
متن کامل